EIS V: Blind Spots In AI Safety Interpretability Research

Part 5 of 12 in the Engineer’s Interpretability Sequence.

Thanks to Anson Ho, Chris Olah, Neel Nanda, and Tony Wang for some discussions and comments.

TAISIC = “the AI safety interpretability community”

MI = “mechanistic interpretability”

Most AI safety interpretability work is conducted by researchers in a relatively small number of places, and TAISIC is closely-connected by personal relationships and the AI alignment forum. Much of the community is focused on a few specific approaches like circuits-style MI, mechanistic anomaly detection, causal scrubbing, and probing. But this is a limited set of topics, and TAISIC might benefit from broader engagement. In the Toward Transparent AI survey (Räuker et al., 2022), we wrote 21 subsections of survey content. Only 1 was on circuits, and only 4 consisted in significant part of work from TAISIC.

I have often heard people in TAISIC explicitly advising more junior researchers to not focus much on reading from the literature and instead to dive into projects. Obviously, experience working on projects is irreplaceable. But not engaging much with the broader literature and community is a recipe for developing insularity and blind spots. I am quick to push back against advice that doesn’t emphasize the importance of engaging with outside work.

Within TAISIC, I have heard interpretability research described as dividing into two sets: mechanistic interpretability and, somewhat pejoratively, “traditional interpretability.” I will be the first to say that some paradigms in interpretability research are unproductive (see EIS III-IV). But I give equal emphasis to the importance of TAISIC not being too parochial. Reasons include maintaining relevance and relationships in the broader community, drawing useful inspiration from past works, making less-correlated bets with what we focus on, and most importantly – not reinventing, renaming, and repeating work that has already been done outside of TAISIC.

TAISIC has reinvented, reframed, or renamed several paradigms

Mechanistic interpretability requires something much like program synthesis, program induction, and/​or programming language translation

“Circuits”-style MI is arguably the most popular and influential approach to interpretability in TAISIC. Doing this work requires iteratively (1) generating hypotheses for what a network is doing and then (2) testing how valid these hypotheses explain its internal mechanisms. Step 2 may not be that difficult, and causal scrubbing (discussed below) seems like a type of solution that will be useful for it. But step 1 is hard. Mechanistic hypothesis generation is a lot like doing program synthesis, program induction, and/​or programming language translation.

Generating mechanistic hypotheses requires synthesizing programs to explain a network using its behavior and/​or structure. If a method for this involves synthesizing programs based on the task or I/​O from the network, it is a form of program synthesis or induction. And if a method is based on using a network’s structure to write down a program to explain it, it is very similar to programming language translation.

In general, program synthesis and program induction are very difficult and currently fail to scale to large problems. This is well-understood, and these fields are mature enough so that we have textbooks on them and how difficult they are (e.g. Gulwani et al., 2017). Meanwhile, programming language translation is very challenging too. In practice, translating between common languages (e.g. Python and Java) is only partially automatable and relies on many hand-coded rules (Qiu, 1999), and using large language models has had limited success (Roziere et al., 2020). And in cases like these, both the source and target language are discrete and easily interpretable. Since this isn’t the case for neural networks, we should expect things to be more difficult for translating them into programs.

It is unclear the extent to which the relationships between program synthesis, induction, and language translation, and MI are understood inside of TAISIC. I do not know of this connection being pointed out before in TAISIC. But understanding this seems important for seeing why MI is difficult and likely to stay that way. MI work in TAISIC has thus far been limited to explaining simple (sub)processes. In cases like these, the program synthesis part of the problem is very easy for a human to accomplish manually. But if a problem can be solved by a program that a human can easily write, then it is not one that we should be applying deep learning to (Rudin, 2018). There will be a much more in-depth discussion of this problem in EIS VI.

If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.

When highly intelligent systems in the future learn unexpected, harmful behaviors, characterizing the neural circuitry involved will probably not be simple like the current MI work that TAISIC focuses on. We should not expect solving toy MI problems using humans to help with real world MI problems any more than we should expect solving toy program synthesis problems using humans to help with real world program synthesis problems. As a result, automating model-guided hypothesis generation seems to be the only hope that MI research has to be very practically relevant. It may be time for a paradigm shift in TAISIC toward symbolic methods. But the fact that existing neurosymbolic work has not yet scaled or been very useful for many practical problems seems to signify difficulties ahead.

Causal scrubbing, compression, and frivolous subnetworks

The above section discussed how MI can be divided into a program generation component and a hypothesis verification component. And when it comes to hypothesis verification, causal scrubbing (Chan et al., 2022) is an exciting approach. It seems to have the potential to be tractable and valuable for this goal.

If our goal is rigorous MI, causal scrubbing can only be as good as the hypotheses that go into it. Relying on hypotheses that are too general will prevent it from being a very precise tool. And this might be fine. For loose goals such as mechanistic anomaly detection, hypotheses that are merely decent may still be useful for flagging anomalous forward passes through a network. Maybe the production of such decent hypotheses can be automated, and they may do a perfectly fair job of capturing useful mechanisms.

But we should be careful. Some causal scrubbing work has been explored using things like gradients, perturbations, ablations, refactorizations, etc. to find parts of the network that can be scrubbed away. But this may not be a very novel or useful approach to hypothesis generation. This particular approach is just a form of network compression. And just because a compressed version of a network seems to accomplish some task does not mean that there is some meaningful mechanism behind it. Ramanujan et al. (2020) showed that randomly initialized networks could be “trained” simply by pruning all of the weights that harmed performance on the task of interest. The resulting subnetwork may accomplish a task of interest, but only in a frivolous sense, and it should not be expected to generalize. So just because a subnetwork in isolation seems to do something doesn’t mean that it really performs that task. This is a type of interpretability illusion.

Polysemanticity and superposition = entanglement

This section is a bit longwinded, but the TL;DR is that TAISIC has done a lot of work on “polysemanticity” and “superposition” in neural networks, but this work is not as novel as it may seem in light of previous work on “entanglement.”

In 2012 Bengio et al. described and studied the “entanglement” of representations among different neurons in networks. To the best of my knowledge, this was the first use of this term in deep learning (although the rough concept goes back to at least Bengio and LeCun (2007)). Since then, there has been a great deal of literature on entanglement – enough for a survey from Carbonneau et. al (2022). See also the disentanglement section from the Toward Transparent AI survey (Räuker et al., 2022). Locatello et a. (2019) describe the goals of this literature as such (parenthetical citations removed for readability):

[Disentangled representations] should contain all the information present in in a compact and interpretable structure while being independent from the task at hand. They should be useful for (semi-)supervised learning of downstream tasks, transfer, and few shot learning. They should enable us to integrate out nuisance factors, to perform interventions, and to answer counterfactual questions.

Does this sound familiar?

In 2016 Arora et al. described and studied embeddings of words that have multiple semantic meanings. They described these words as “polysemous” and their embeddings as in “superposition.” To the best of my knowledge, this was the first use of “polysemous” and “superposition” to describe embeddings and embedded concepts in deep learning. And to my knowledge, Arora et al. (2016) was the only work prior to TAISIC’s work in 2017 on this topic.

Later on, Olah et al. (2017) characterized neurons which seem to detect multiple unrelated features, and later, Olah et. al (2020) described neurons that seem to respond to multiple unrelated features as “polysemantic.” Olah et. al (2020) writes

Our hope is that it may be possible to resolve polysemantic neurons, perhaps by “unfolding” a network to turn polysemantic neurons into pure features, or training networks to not exhibit polysemanticity in the first place.

Olah et. al (2020) also used the term “superposition”.

Polysemantic neurons…seem to result from a phenomenon we call “superposition” where a circuit spreads a feature across many neurons

And things are even muddier than this. Thorpe (1989) studied how embeddings can densely represent a larger number of distinct concepts than they have dimensions under the term “distributed coding.” And Losch et al. (2019) describe a process for creating a disentangled latent layer as “semantic bottlenecking.” I don’t know how many other terms in various literatures describe similar concepts as entanglement, polysemanticity, superposition, distributed coding, and bottlenecking. And I don’t care much to sift through things thoroughly enough to find out. Instead, the point here is that in light of the literature on entanglement, many of the contributions that TAISIC has made related to polysemanticity and superposition are not very novel.

Olah et al. (2017) and Olah et. al (2020) did not do a thorough job of engaging with the entanglement literature. The only mention of it made by either was from Olah et. al (2020) which wrote without citation:

This is essentially the problem studied in the literature of disentangling representations…At present that literature tends to focus on known features in the latent spaces of generative models.

Although it should be noted that this blog post from 2017 also discussed “superposition.”

Based on my knowledge of the entanglement literature, it is true that most but not all papers using the term study autoencoders. But it is not clear why this matters from the perspective of studying entanglement, polysemanticity, and superposition. Besides, an entangled encoder can be used to extract features for a classifier. This is just a form of “bottlenecking” (Losch et al., 2019) – another concept that predates Olah et. al (2020).

To be clear, it seems that the authors of Olah et al. (2017) and Olah et. al (2020) were aware of the entanglement literature, and later, their discussion of related work in Elhage et al. (2022) was much more thorough. But ultimately, Olah et al. (2017) and Olah et. al (2020) did not very thoroughly engage with the entanglement literature. And when Olah et al. (2017) and Olah et. al (2020) were written, the term “entanglement” was much more standard in the deep learning literature than “polysemanticity” and “superposition.”

Details (which I could be wrong about) and speculation (ditto) aside, two different groups of AI researchers have now been working on the same problems under different names, and this isn’t good. The mainstream one uses “entanglement” while TAISIC uses “polysemanticity” and “superposition.” Terminology matters, and it may be the case that TAISIC’s terminology has caused a type of generational isolation among different groups of AI researchers.

There is a lot of useful literature on both supervised and unsupervised entanglement. Instead of listing papers, I’ll refer anyone interested to page 7 of the Toward Transparent AI survey (Räuker et al., 2022). Some researchers in TAISIC may find valuable insights from these works.

One disentanglement method that has come from TAISIC is the softmax linear unit activation function from Elhage et al. (2022). They train a network to be more disentangled using an activation function that makes neurons in the same layer compete for activations. Lateral inhibition being used as a solution to entanglement is nothing new. Again, see page 7 of the Toward Transparent AI survey (Räuker et al., 2022). And a fun fact is that even AlexNet (Krizheveky et al., 2012) used a form of lateral inhibition called “local response normalization.” But Elhage et al. (2022) engages very little with prior work like this in its discussion of related work. It gives the impression that their technique is more novel than it is.

The whole saga involving distributed coding, entanglement, polysemanticity, superposition, and bottlenecking serves as an example of how powerful terminology can be in influencing how the research community understands and approaches problems. This story highlights the importance of engaging thoroughly with previous works and being careful about terminology.

Deceptive alignment ≈ trojans

This discussion will be short because deception will be the main focus of EIS VIII. But spoiler alert: detecting and fixing deception is an almost identical technical problem to detecting and fixing trojans. The only difference is that deceptiveness typically results from an inner alignment failure while trojans are typically implanted with data poisoning which simulates an outer alignment failure. From an engineering standpoint though, this difference is often tenuous. This isn’t a major blind spot per se – many researchers in TAISIC understand this connection and are doing excellent work with trojans. TAISIC should do its best to ensure that this connection is more universally understood.

Unsupervised contrast consistent search = self-supervised contrastive probing

One recent paper from TAISIC presents a way to train a classifier that predicts when models will say dishonest things based on their inner activations (Burns et al., 2022). This type of approach seems promising. But the paper names its method “contrast consistent search” and describes it as “unsupervised,” both of which I have nitpicks for. The first is that “contrast consistent search” is much better described as “contrastive probing,” and while the paper refers to the probe as a “probe,” the related works and citations do not engage with the probing literature—non-supervised probing has been done before (e.g. Hoyt et al. (2021)). Second, this method is not exactly “unsupervised.” It is better described as self-supervised because it requires using paired true and false statements. See Jaiswal et al. (2021) titled A Survey on Contrastive Self-Supervised Learning for definitions. In future work, it will be useful to name methods and discuss related work in ways that minimize the possibility of confusion or isolation.

Why so little work on intrinsic interpretability?

There are two basic approaches to interpretability. Intrinsic interpretability techniques involve designing/​training models to be easier to study in the first place while post hoc interpretability techniques involve interpreting models after they have been trained. The Toward Transparent AI survey (Räuker et al., 2022) divides its discussion of methods into intrinsic and post hoc ones if you would like to look into this more.

Some great news is that because intrinsic interpretability techniques operate on the model before or during training and post hoc ones operate on it after, combining intrinsic and post hoc methods almost always works well! And given this, it’s odd that with some exceptions (e.g. Elhage et al. (2022)), the large majority of work from TAISIC is on post hoc methods. Maybe it is because of some founder effects plus how TAISIC is still fairly small. In the Toward Transparent AI survey (Räuker et al., 2022) we also speculate about how a lack of benchmarking means a lack of incentive for results-focused work which means a lack of incentive for studying useful synergies between novel combinations of non-novel methods.

But whatever the reason, TAISIC should do more work to study intrinsic interpretability tools and combine them with post hoc analysis. The main reason is the obvious one – that this may significantly improve interpretability results. But this should also be of particular interest to MI researchers. Recall the discussion above about how automating model-guided program synthesis may be necessary if circuits-style MI is to be useful. Designing more intrinsically interpretable systems may be helpful for this. It also seems to be fairly low-hanging fruit. Many intrinsic interpretability methods (e.g. modular architectures, pruning, some regularization techniques, adversarial training) are simple to implement but have rarely been studied alongside post hoc interpretability tools.

Questions

  • Do you know of any other examples from TAISIC of reinvented, reframed, or renamed paradigms? Do you know of other notable examples of this outside of TAISIC?

  • Do you agree or disagree with the claim that program generation is the crucial step in mechanistic interpretability? Do you agree or disagree with the claim that mechanistic interpretability research in TAISIC is currently not addressing this very well?

  • Do you know of any past work discussing how mechanistic interpretability involves program synthesis, induction, and/​or language translation?

  • Are you or anyone you know working on neurosymbolic approaches to mechanistic interpretability?

  • Do you know of any deep learning works prior to 2012 that use the term “entanglement”? Do you know of any prior to 2016 that use “polysemy”/​”polysemanticity” or “superposition”? Do you know of any other redundant names for “distributed coding,” “entanglement,” “polysemanticity,” “superposition,” or “bottlenecking?”

  • Are you or anyone you know doing interesting work with trojans?

  • Do you have any other hypotheses for why TAISIC doesn’t focus very much on intrinsic interpretability tools?